The automotive trends dataset examines trends in vehicle parameters like weight, horsepower, and many others as well as fuel economy, CO2 emissions, production share, and other factors.
The data for this analysis came from automotive trends, and the United States Environmental Protection Agency [EPA] provided a dataset.
Below is the link for the data source:
https://www.epa.gov/automotive-trends/explore-automotive-trends-data#SummaryData
The EPA collected data via laboratory testing or directly from manufacturers using official EPA test procedures.Moreover,Since 1975, the EPA has maintained the dataset to provide the public with information about fuel economy, technology data, emissions, and auto manufacturers’ performance in meeting the agency’s greenhouse gas emissions standards, as well as to support national programs.
The cases,each row in this dataset represents the specifics of each vehicle type’s performance during a given year.
The variables used in this analysis are listed below.
Model year: Model year: From 1975 to 2022, we can see the vehicle’s model year in this dataset.
Regulatory class : This category indicates whether the vehicle is a car or a truck.
Vehicle type: We can see what type of vehicle it is in this category, whether it is a car or a truck, such as a car SUV, van or minivan, pickup, sedan, wagon, or truck SUV.
Production Share: This variable indicates the production share.
Real word Id. MPG :We can see the actual miles per gallon here.
Real.World.MPG_City : In this column we can see what is miles per gallon for each city
Real.World.MPG_Hwy : The miles per gallon for each city can be found in this column.
Real.World.CO2..g.mi : The CO2 emission grams per mile in the real world are shown in this category.
Real.World.CO2_City..g.mi : The carbon dioxide (CO2) grams per mile in real-world cities are displayed here.
Real.World.CO2_Hwy..g.mi : In this category, we can see the CO2 emissions grams per mile for real-world highways.
Weight : We can see that the weight units are measured in lbs.
Horsepower : As we can see, horsepower is measured in hp.
The dataset I have taken is from an observational study.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(readxl)
rawdata <- read.csv("~/Desktop/201/ Automotive_Trends Report_Project.csv")
head(rawdata)
## Model.Year Regulatory.Class Vehicle.Type Production.Share Real.World.MPG
## 1 1975 All All 1 13.05970
## 2 1975 Car All Car 0.806646 13.45483
## 3 1975 Car Sedan/Wagon 0.805645 13.45833
## 4 1975 Truck All Truck 0.193354 11.63431
## 5 1975 Truck Pickup 0.131322 11.91476
## 6 1975 Truck Minivan/Van 0.0447 11.10606
## Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## 1 12.01552 14.61167 680.5961
## 2 12.31413 15.17266 660.6374
## 3 12.31742 15.17643 660.4660
## 4 10.91165 12.65900 763.8613
## 5 11.07827 13.12613 745.8814
## 6 10.55642 11.86084 800.1940
## Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## 1 739.7380 608.3116 4060.399
## 2 721.8293 585.8472 4057.494
## 3 721.6367 585.7019 4057.565
## 4 814.4506 702.0300 4072.518
## 5 802.2009 677.0464 4011.977
## 6 841.8573 749.2722 4195.690
## Horsepower..HP. Footprint..sq..ft..
## 1 137.3346 -
## 2 136.1964 -
## 3 136.2256 -
## 4 142.0826 -
## 5 140.9365 -
## 6 143.2245 -
Output: We observed the preceding data for the first six rows.
dim(rawdata)
## [1] 384 13
Output : According to the above observation, there are 384 rows and 13 columns.
tail(rawdata)
## Model.Year Regulatory.Class Vehicle.Type Production.Share Real.World.MPG
## 379 Prelim. 2022 Car Car SUV - 32.38793
## 380 Prelim. 2022 All All - 26.35965
## 381 Prelim. 2022 Truck Minivan/Van - 25.59317
## 382 Prelim. 2022 Truck Truck SUV - 24.75038
## 383 Prelim. 2022 Truck All Truck - 23.40912
## 384 Prelim. 2022 Truck Pickup - 20.06288
## Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## 379 29.25306 35.23655 261.7094
## 380 23.17949 29.40284 330.8116
## 381 22.10621 29.04996 344.2938
## 382 21.90441 27.43990 354.1329
## 383 20.60126 26.09186 375.9269
## 384 17.49366 22.56268 442.4302
## Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## 379 291.7420 239.0533 3832.524
## 380 377.1848 295.8283 4328.963
## 381 398.0267 303.7584 4557.279
## 382 400.5455 319.1199 4534.261
## 383 427.5858 336.9561 4713.739
## 384 508.0322 392.9410 5239.220
## Horsepower..HP. Footprint..sq..ft..
## 379 269.6559 47.34132
## 380 272.3535 51.67437
## 381 245.0592 56.21571
## 382 268.1756 50.02365
## 383 284.8583 54.37582
## 384 339.0876 65.91698
Output : We examined the last six rows of data.
If it is a data frame, the result will be True. If it is a tibble, the output will be False, so we must convert it to a tibble using as_tibble.
is.data.frame(rawdata)
## [1] TRUE
Output: The automotive data is in data frame format.
newdata<- as_tibble(rawdata)
is_tibble(newdata)
## [1] TRUE
Output: Because we received an output of “true,” the above data is a tibble.
summary(rawdata)
## Model.Year Regulatory.Class Vehicle.Type Production.Share
## Length:384 Length:384 Length:384 Length:384
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## Min. :10.53 Min. : 9.393 Min. :10.81 Min. :254.0
## 1st Qu.:17.03 1st Qu.:15.001 1st Qu.:19.33 1st Qu.:386.2
## Median :19.38 Median :16.898 Median :22.54 Median :458.9
## Mean :20.00 Mean :17.491 Mean :22.91 Mean :466.6
## 3rd Qu.:23.02 3rd Qu.:19.755 3rd Qu.:27.02 3rd Qu.:522.6
## Max. :33.71 Max. :29.253 Max. :38.19 Max. :844.0
## Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## Min. :291.7 Min. :222.7 Min. :2630
## 1st Qu.:449.7 1st Qu.:328.6 1st Qu.:3536
## Median :526.5 Median :395.2 Median :3991
## Mean :528.6 Mean :410.4 Mean :3987
## 3rd Qu.:592.6 3rd Qu.:459.9 3rd Qu.:4415
## Max. :946.2 Max. :822.0 Max. :5485
## Horsepower..HP. Footprint..sq..ft..
## Min. : 87.81 Length:384
## 1st Qu.:138.15 Class :character
## Median :178.48 Mode :character
## Mean :183.10
## 3rd Qu.:215.07
## Max. :345.67
Output : The dataset has 384 observations and 13 variables, there are ten numeric variables.
colSums(is.na(rawdata))
## Model.Year Regulatory.Class
## 0 0
## Vehicle.Type Production.Share
## 0 0
## Real.World.MPG Real.World.MPG_City
## 0 0
## Real.World.MPG_Hwy Real.World.CO2..g.mi.
## 0 0
## Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 0 0
## Weight..lbs. Horsepower..HP.
## 0 0
## Footprint..sq..ft..
## 0
Output: We received a result of 0, indicating that there are no missing values in any of the columns.
sum(duplicated(rawdata))
## [1] 0
Output : There are no values that are duplicates or missing.
I have taken sample data from Real.World.MPG_City.of automotive trends differs significantly from 17.491, a one-sample t-test will be used. The limit for significance will be set at 0.05. The null hypothesis in this test is that the mean weight in lbs of automotive trends equals 3884, while the alternate hypothesis is that the mean weight in lbs of automotive trends does not equal 3884.
Real.World.CO2_Hwy..g.mi.
t.test(rawdata$ Real.World.MPG_City, mu = 17.491)
##
## One Sample t-test
##
## data: rawdata$Real.World.MPG_City
## t = 0.0003001, df = 383, p-value = 0.9998
## alternative hypothesis: true mean is not equal to 17.491
## 95 percent confidence interval:
## 17.13805 17.84406
## sample estimates:
## mean of x
## 17.49105
Output: If the P value is less than 0.05, we can reject the null hypothesis but from the above result we can see the P value is greater than alpha value so we cannot reject the null hypothesis.
We need to determine whether there is a relationship between Vehicle weight and horsepower. We use a correlation test to compute the Pearson correlation coefficient between two variables. We can reject the null hypothesis and conclude that there is a relationship between Vehicle weight and horsepower if the P value is less than the significance level.
cor.test(rawdata$Weight..lbs., rawdata$ Horsepower..HP., method = 'pearson' )
##
## Pearson's product-moment correlation
##
## data: rawdata$Weight..lbs. and rawdata$Horsepower..HP.
## t = 21.81, df = 382, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6965565 0.7862007
## sample estimates:
## cor
## 0.7447192
Output: We can see from the above test that the P value is greater than the significance level, so we can accept the null hypothesis and conclude that Vehicle weight and horsepower have a linear relationship.
To determine whether the mean of Real.World.CO2_City is less than 400 grams per mile, we will use a one-sided T test. The null hypothesis states that the mean of Real.World.CO2_City data for automotive trends is greater than or equal to 400 grams per mile, while the alternate hypothesis states that the mean of Real.World.CO2_City data for automotive trends is less than or equal to 400 grams per mile.
t.test(rawdata$Real.World.CO2_City..g.mi., mu = 400 , alternative = "less" )
##
## One Sample t-test
##
## data: rawdata$Real.World.CO2_City..g.mi.
## t = 23.499, df = 383, p-value = 1
## alternative hypothesis: true mean is less than 400
## 95 percent confidence interval:
## -Inf 537.6543
## sample estimates:
## mean of x
## 528.6287
Output : Because the P value is greater than 0.05, we cannot reject the null hypothesis and conclude that there is insufficient evidence to support a difference in the mean of Real.World. For automotive trends, the CO2_City data is 400 grams per mile.
H0 = mean of CarSUV = mean of TruckSUV
H1 = mean of CarSUV ≠ mean of TruckSUV
CarSUV <- rawdata$Real.World.MPG[rawdata$Vehicle.Type == "Car SUV"]
TruckSUV <- rawdata$Real.World.MPG[rawdata$Vehicle.Type == "Truck SUV"]
print(M1 <- mean(CarSUV))
## [1] 20.22622
print(M2 <- mean(TruckSUV))
## [1] 17.44931
standard_deviation1 <- sd(CarSUV)
print(paste('Standard Deviation of Car SUV: ', standard_deviation1))
## [1] "Standard Deviation of Car SUV: 4.76605239040631"
standard_deviation2 <- sd(TruckSUV)
print(paste('Standard Deviation of TruckSUV: ', standard_deviation2))
## [1] "Standard Deviation of TruckSUV: 3.43779399551164"
n<-384
StandardError <- sqrt((standard_deviation1^2/n) + (standard_deviation2^2/n))
StandardError
## [1] 0.2998858
A1 <- M1
A2 <- M2
tstat <- (A1-A2)/StandardError
alpha = 0.05 # as two tailed test we are dividing alpha value 0.05/2 = 0.025
zscore <- 1.96
tstat
## [1] 9.259889
print(dof <- (n+n)-2)
## [1] 766
print(p_value <- 2 * pt(-abs(tstat), dof))
## [1] 2.014926e-19
Output : The null hypothesis is being rejected since, according to t-statistics, the p-value is smaller than the alpha value and the t value is greater than 1.96. Because it is a two-tailed test, we reject the null hypothesis if Z is less than -1.96 or more than 1.96. So the car fuel efficiency of car suv doest not equal to Truck suv.
summary(rawdata)
## Model.Year Regulatory.Class Vehicle.Type Production.Share
## Length:384 Length:384 Length:384 1 : 47
## Class :character Class :character Class :character - : 8
## Mode :character Mode :character Mode :character 0.00002 : 1
## 0.000032: 1
## 0.000925: 1
## 0.001001: 1
## (Other) :325
## Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## Min. :10.53 Min. : 9.393 Min. :10.81 Min. :254.0
## 1st Qu.:17.03 1st Qu.:15.001 1st Qu.:19.33 1st Qu.:386.2
## Median :19.38 Median :16.898 Median :22.54 Median :458.9
## Mean :20.00 Mean :17.491 Mean :22.91 Mean :466.6
## 3rd Qu.:23.02 3rd Qu.:19.755 3rd Qu.:27.02 3rd Qu.:522.6
## Max. :33.71 Max. :29.253 Max. :38.19 Max. :844.0
##
## Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## Min. :291.7 Min. :222.7 Min. :2630
## 1st Qu.:449.7 1st Qu.:328.6 1st Qu.:3536
## Median :526.5 Median :395.2 Median :3991
## Mean :528.6 Mean :410.4 Mean :3987
## 3rd Qu.:592.6 3rd Qu.:459.9 3rd Qu.:4415
## Max. :946.2 Max. :822.0 Max. :5485
##
## Horsepower..HP. Footprint..sq..ft..
## Min. : 87.81 - :264
## 1st Qu.:138.15 44.92996: 1
## Median :178.48 45.04628: 1
## Mean :183.10 45.20013: 1
## 3rd Qu.:215.07 45.21904: 1
## Max. :345.67 45.31546: 1
## (Other) :115
Mean_Real.World.MPG <- rawdata$Real.World.MPG
mean(Mean_Real.World.MPG)
## [1] 19.99682
Mean_Real.World.MPG_City <- rawdata$Real.World.MPG_City
mean(Mean_Real.World.MPG_City)
## [1] 17.49105
Mean_Real.World.MPG_Hwy <- rawdata$Real.World.MPG_Hwy
mean(Mean_Real.World.MPG_Hwy)
## [1] 22.91446
Mean_Real.World.CO2..g.mi. <- rawdata$Real.World.CO2..g.mi.
mean(Mean_Real.World.CO2..g.mi.)
## [1] 466.6172
Mean_Real.World.CO2_Hwy..g.mi. <- rawdata$Real.World.CO2_Hwy..g.mi.
mean(Mean_Real.World.CO2_Hwy..g.mi.)
## [1] 410.4031
Mean_Real.World.CO2_City..g.mi. <- rawdata$Real.World.CO2_City..g.mi.
mean(Mean_Real.World.CO2_City..g.mi.)
## [1] 528.6287
Mean_Weight..lbs.<- rawdata$Weight..lbs.
mean(Mean_Weight..lbs.)
## [1] 3987.069
Mean_Horsepower..HP. <- rawdata$Horsepower..HP.
mean(Mean_Horsepower..HP.)
## [1] 183.1011
Median_Horsepower..HP. <- rawdata$Horsepower..HP.
median(Median_Horsepower..HP.)
## [1] 178.4841
Median_Real.World.MPG <- rawdata$Real.World.MPG
median(Median_Real.World.MPG)
## [1] 19.37544
Median_Real.World.CO2..g.mi. <- rawdata$Real.World.CO2..g.mi.
median(Median_Real.World.CO2..g.mi.)
## [1] 458.931
Median_Weight..lbs <- rawdata$Weight..lbs
median(Median_Weight..lbs)
## [1] 3991.068
Median_Real.World.MPG_City <- rawdata$Real.World.MPG_City
median(Median_Real.World.MPG_City)
## [1] 16.89832
Median_Real.World.MPG_Hwy <- rawdata$Real.World.MPG_Hwy
median(Median_Real.World.MPG_Hwy)
## [1] 22.53625
Median_Real.World.CO2_City..g.mi. <- rawdata$Real.World.CO2_City..g.mi.
median(Median_Real.World.CO2_City..g.mi.)
## [1] 526.5415
Median_Real.World.CO2_Hwy..g.mi. <- rawdata$Real.World.CO2_Hwy..g.mi.
median(Median_Real.World.CO2_Hwy..g.mi.)
## [1] 395.231
range(rawdata$Real.World.CO2_Hwy..g.mi.)
## [1] 222.7413 821.9988
range(rawdata$Real.World.CO2_City..g.mi.)
## [1] 291.7420 946.1582
range(rawdata$Real.World.MPG_Hwy)
## [1] 10.81307 38.19438
range(rawdata$Real.World.MPG_City)
## [1] 9.39272 29.25306
range(rawdata$Weight..lbs.)
## [1] 2629.999 5484.824
range(rawdata$Real.World.CO2..g.mi)
## [1] 253.9547 844.0170
range(rawdata$Real.World.MPG)
## [1] 10.53097 33.71184
range(rawdata$Horsepower..HP.)
## [1] 87.8139 345.6733
sd(rawdata$Real.World.CO2_Hwy..g.mi.)
## [1] 104.3922
sd(rawdata$Real.World.CO2_City..g.mi.)
## [1] 107.2658
sd(rawdata$Real.World.MPG_Hwy)
## [1] 5.274994
sd(rawdata$Real.World.MPG_City)
## [1] 3.51826
sd(rawdata$Weight..lbs.)
## [1] 549.3396
sd(rawdata$Real.World.CO2..g.mi)
## [1] 107.4801
sd(rawdata$Real.World.MPG)
## [1] 4.374913
sd(rawdata$Horsepower..HP.)
## [1] 55.78506
var(rawdata$Real.World.CO2_Hwy..g.mi.)
## [1] 10897.72
var(rawdata$Real.World.CO2_City..g.mi.)
## [1] 11505.96
var(rawdata$Real.World.MPG_Hwy)
## [1] 27.82556
var(rawdata$Real.World.MPG_City)
## [1] 12.37815
var(rawdata$Weight..lbs.)
## [1] 301774
var(rawdata$Real.World.CO2..g.mi)
## [1] 11551.96
var(rawdata$Real.World.MPG)
## [1] 19.13986
var(rawdata$Horsepower..HP.)
## [1] 3111.973
HP <- rawdata$Horsepower..HP.
MPG <-rawdata$Real.World.MPG
World_CO2 <- rawdata$Real.World.CO2..g.mi
weight <- rawdata$Weight..lbs.
World.MPG_Hwy <- rawdata$Real.World.MPG_Hwy
World.CO2_Hwy <- rawdata$Real.World.CO2_Hwy..g.mi.
Mode <- function(x){
ux <- unique(x)
ux[which.max(tabulate(match(x,ux)))]
}
Mode(HP)
## [1] 137.3346
Mode(MPG)
## [1] 13.0597
Mode(World_CO2)
## [1] 680.5961
Mode(weight)
## [1] 4000
Mode(World.MPG_Hwy)
## [1] 14.61167
Mode(World.CO2_Hwy)
## [1] 608.3116
IQR(rawdata$Real.World.MPG)
## [1] 5.991398
IQR(rawdata$Real.World.MPG_City)
## [1] 4.753562
IQR(rawdata$Real.World.MPG_Hwy)
## [1] 7.690288
IQR(rawdata$Real.World.CO2..g.mi.)
## [1] 136.4557
IQR(rawdata$.World.CO2_Hwy..g.mi)
## [1] NA
IQR(rawdata$Real.World.CO2_City..g.mi.)
## [1] 142.9086
IQR(rawdata$Weight..lbs.)
## [1] 878.8207
IQR(rawdata$Horsepower..HP.)
## [1] 76.91698
str(rawdata)
## 'data.frame': 384 obs. of 13 variables:
## $ Model.Year : chr "1975" "1975" "1975" "1975" ...
## $ Regulatory.Class : chr "All" "Car" "Car" "Truck" ...
## $ Vehicle.Type : chr "All" "All Car" "Sedan/Wagon" "All Truck" ...
## $ Production.Share : Factor w/ 331 levels "-","0.00002",..: 331 326 325 173 129 60 22 5 331 320 ...
## $ Real.World.MPG : num 13.1 13.5 13.5 11.6 11.9 ...
## $ Real.World.MPG_City : num 12 12.3 12.3 10.9 11.1 ...
## $ Real.World.MPG_Hwy : num 14.6 15.2 15.2 12.7 13.1 ...
## $ Real.World.CO2..g.mi. : num 681 661 660 764 746 ...
## $ Real.World.CO2_City..g.mi.: num 740 722 722 814 802 ...
## $ Real.World.CO2_Hwy..g.mi. : num 608 586 586 702 677 ...
## $ Weight..lbs. : num 4060 4057 4058 4073 4012 ...
## $ Horsepower..HP. : num 137 136 136 142 141 ...
## $ Footprint..sq..ft.. : Factor w/ 121 levels "-","44.92996",..: 1 1 1 1 1 1 1 1 1 1 ...
Output: There are three category variables, as can be seen from the output above.
numeric_dataset <- function(Dataset){
nums <- sapply(Dataset, is.numeric)
return(Dataset[ , nums])
}
automotive_num <- numeric_dataset(rawdata)
str(automotive_num)
## 'data.frame': 384 obs. of 8 variables:
## $ Real.World.MPG : num 13.1 13.5 13.5 11.6 11.9 ...
## $ Real.World.MPG_City : num 12 12.3 12.3 10.9 11.1 ...
## $ Real.World.MPG_Hwy : num 14.6 15.2 15.2 12.7 13.1 ...
## $ Real.World.CO2..g.mi. : num 681 661 660 764 746 ...
## $ Real.World.CO2_City..g.mi.: num 740 722 722 814 802 ...
## $ Real.World.CO2_Hwy..g.mi. : num 608 586 586 702 677 ...
## $ Weight..lbs. : num 4060 4057 4058 4073 4012 ...
## $ Horsepower..HP. : num 137 136 136 142 141 ...
Only numerical variables are visible in the output from the above query.
head(automotive_num)
## Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## 1 13.05970 12.01552 14.61167 680.5961
## 2 13.45483 12.31413 15.17266 660.6374
## 3 13.45833 12.31742 15.17643 660.4660
## 4 11.63431 10.91165 12.65900 763.8613
## 5 11.91476 11.07827 13.12613 745.8814
## 6 11.10606 10.55642 11.86084 800.1940
## Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## 1 739.7380 608.3116 4060.399
## 2 721.8293 585.8472 4057.494
## 3 721.6367 585.7019 4057.565
## 4 814.4506 702.0300 4072.518
## 5 802.2009 677.0464 4011.977
## 6 841.8573 749.2722 4195.690
## Horsepower..HP.
## 1 137.3346
## 2 136.1964
## 3 136.2256
## 4 142.0826
## 5 140.9365
## 6 143.2245
The output from the above shows the first rows of data.
aggregate_mean <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = mean)
colnames(aggregate_mean)[1] <- "Vehicle.Type"
dim(aggregate_mean)
## [1] 8 9
aggregate_mean
## Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1 All 21.05023 18.36558 24.21940
## 2 All Car 23.80026 20.58441 27.65969
## 3 All Truck 17.77338 15.69641 20.12641
## 4 Car SUV 20.22622 17.84962 22.93543
## 5 Minivan/Van 18.45708 16.06780 21.21303
## 6 Pickup 17.09726 15.14403 19.32361
## 7 Sedan/Wagon 24.12080 20.80189 28.11248
## 8 Truck SUV 17.44931 15.41869 19.72568
## Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1 431.3862 492.3273 375.8517
## 2 385.4366 443.0676 332.2874
## 3 511.9602 576.1101 454.6119
## 4 465.4647 524.2578 410.7897
## 5 500.2546 566.8536 440.8516
## 6 527.1617 593.8871 468.0765
## 7 381.3337 439.1042 328.0781
## 8 529.9395 593.4221 472.6779
## Weight..lbs. Horsepower..HP.
## 1 3780.094 176.0955
## 2 3414.820 159.5193
## 3 4356.861 199.8513
## 4 3738.246 167.2514
## 5 4328.552 189.1294
## 6 4481.193 215.5737
## 7 3386.146 158.9195
## 8 4410.642 198.4687
aggregate_Standard_Deviation <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = sd)
colnames(aggregate_Standard_Deviation)[1] <- "Vehicle.Type"
head(aggregate_Standard_Deviation)
## Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1 All 2.896313 2.305560 3.427399
## 2 All Car 4.069109 3.272402 4.700826
## 3 All Truck 2.611033 2.012260 3.202051
## 4 Car SUV 4.766052 4.033952 5.377006
## 5 Minivan/Van 3.439250 2.506315 4.393650
## 6 Pickup 1.825319 1.542381 2.268045
## Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1 70.33875 69.91426 66.41512
## 2 78.08117 78.64341 71.88232
## 3 83.04248 77.73747 84.51645
## 4 124.29870 131.47756 110.86445
## 5 107.65964 95.28129 112.57889
## 6 65.51111 64.48457 67.31921
## Weight..lbs. Horsepower..HP.
## 1 346.2237 49.73348
## 2 252.1981 38.06582
## 3 383.0226 60.31795
## 4 255.7871 39.51241
## 5 185.8731 47.97849
## 6 643.3320 82.74474
aggregate_var <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = var)
colnames(aggregate_var)[1] <- "Vehicle.Type"
head(aggregate_var)
## Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1 All 8.388631 5.315607 11.747064
## 2 All Car 16.557648 10.708616 22.097765
## 3 All Truck 6.817493 4.049190 10.253131
## 4 Car SUV 22.715255 16.272772 28.912195
## 5 Minivan/Van 11.828443 6.281613 19.304160
## 6 Pickup 3.331790 2.378940 5.144027
## Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1 4947.539 4888.004 4410.969
## 2 6096.669 6184.786 5167.068
## 3 6896.053 6043.115 7143.031
## 4 15450.168 17286.348 12290.926
## 5 11590.599 9078.524 12674.007
## 6 4291.706 4158.260 4531.876
## Weight..lbs. Horsepower..HP.
## 1 119870.83 2473.419
## 2 63603.90 1449.007
## 3 146706.34 3638.255
## 4 65427.03 1561.230
## 5 34548.82 2301.936
## 6 413876.12 6846.692
aggregate_Mode <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = mode)
colnames(aggregate_Mode)[1] <- "Vehicle.Type"
head(aggregate_Mode)
## Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1 All numeric numeric numeric
## 2 All Car numeric numeric numeric
## 3 All Truck numeric numeric numeric
## 4 Car SUV numeric numeric numeric
## 5 Minivan/Van numeric numeric numeric
## 6 Pickup numeric numeric numeric
## Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1 numeric numeric numeric
## 2 numeric numeric numeric
## 3 numeric numeric numeric
## 4 numeric numeric numeric
## 5 numeric numeric numeric
## 6 numeric numeric numeric
## Weight..lbs. Horsepower..HP.
## 1 numeric numeric
## 2 numeric numeric
## 3 numeric numeric
## 4 numeric numeric
## 5 numeric numeric
## 6 numeric numeric
aggregate_Median <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = median)
colnames(aggregate_Median)[1] <- "Vehicle.Type"
head(aggregate_Median)
## Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1 All 20.96036 18.74020 24.21608
## 2 All Car 23.15597 19.95090 27.44609
## 3 All Truck 17.40234 15.67611 19.77885
## 4 Car SUV 19.36475 17.10980 22.23017
## 5 Minivan/Van 18.30902 16.02602 21.43195
## 6 Pickup 17.32420 15.17568 19.82763
## Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1 425.0867 474.9688 367.0256
## 2 383.8861 445.8605 323.8340
## 3 511.8291 567.1100 449.4054
## 4 458.9310 519.4158 399.7855
## 5 485.3905 554.5356 414.6642
## 6 513.0584 585.7171 448.2758
## Weight..lbs. Horsepower..HP.
## 1 3896.740 175.1865
## 2 3460.863 161.9397
## 3 4407.449 194.0311
## 4 3805.244 171.6564
## 5 4373.051 181.3874
## 6 4377.310 199.1631
Output : Calculations show that a car SUV gets fewer miles per gallon on average than a sedan, and that the SUV gets higher values for CO2 emissions, weight, and horsepower. Compared to mpg, the emission, weight, and horsepower variables have a higher level of data variance.However, further research must be done before making any firm conclusions. The data suggests that sedans and wagons perform better than automobiles and SUVs in terms of miles per gallon when driving in cities or on highways and release fewer emissions.
ggplot(rawdata, aes(x=Horsepower..HP., y = Real.World.MPG, group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('Vehicle efficiency againt Power')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
Output: However, it can be shown that after adding a particular number of pounds, a car or SUV’s miles per gallon dramatically rose. This is contrary to how miles per gallon decline as weight increases.
ggplot(rawdata, aes(x=Weight..lbs., y = Real.World.MPG, group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('Vehicle efficiency againt Weight')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
Output : But it can be demonstrated that adding a certain number of pounds has a significant impact on an automobile or SUV’s MPG. Additionally, the miles per gallon decrease as the weight increases.
dataframe <- data.frame(
MPG = rawdata$Real.World.MPG,
CO2 = rawdata$Real.World.CO2..g.mi.,
W = rawdata$Weight..lbs.,
HP = rawdata$Horsepower..HP.
)
par(mfrow=c(1,4)) # Need to set up a 1x4 grid of plots
for(i in 1:ncol(dataframe)) {
hist(dataframe[,i], main=colnames(df)[i], xlab="", col="pink")
}
Output : Almost all of the variables follow a normal distribution. Both MPG and CO2 have a right-skewed distribution.
ggplot(rawdata, aes(x=Vehicle.Type, y=Real.World.MPG)) +
geom_boxplot(fill="grey", color="blue") +
ggtitle("Fuel economy by vehicle class") +
ylab("Miles per gallon")
Output : The plots show an average fuel economy of 21 MPG across all vehicle categories, and it is clear that sedans have higher fuel efficiency than vehicles, SUVs, and trucks, which have the lowest efficiency.
ggplot(data = rawdata) + geom_point(mapping = aes(x=Model.Year, y = Real.World.MPG, color = Vehicle.Type), position = "jitter", alpha = 0.2, show.legend = FALSE) + facet_wrap(~ Vehicle.Type)
Output: Nowadays, all sorts of cars run more effectively, and efficiency is generally on the rise.Furthermore, sedans have continuously had higher MPG over the years when compared to other vehicle classifications.
ggplot(data = rawdata) + geom_point(mapping = aes(x=Weight..lbs., y = Real.World.MPG), position = "jitter", alpha = 0.5) + geom_smooth(mapping = aes(x=Weight..lbs., y = Real.World.MPG)) + labs(x = "Weight (lbs)", y = "Miles per Gallon", title = " miles per gallon and weight correlation")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Output: According to the Environmental Protection Agency, a car’s fuel
economy improves by 1-2% for every 100 pounds eliminated from it. It’s
very amazing how the MPG first increased as the vehicle’s weight
increased up to 3500 LBS. After this, however, the efficiency dropped as
the weight increased as was to be expected.
ggplot(rawdata, aes(x=Model.Year, y=Real.World.CO2..g.mi., group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('Emission for vehicle types across years')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank()) +theme(axis.text.y=element_blank(),axis.ticks.y=element_blank())
Output : Emissions trend over time for all vehicle classes over the years, the emission shows a declining tendency, with sedans and wagons having the lowest levels. This shows that automakers are developing fuel-efficient automobiles that produce less emissions.Moreover,it can be shown through the exploratory data analysis that the vehicle would operate more efficiently with the ideal balance of weight, horsepower, and additional factors like engine conditions and aerodynamics, which are not included in this data set. Choosing a fuel-efficient car is crucial if you want to keep the environment cleaner and reduce pollution.
ggplot(data = rawdata) + geom_point(mapping = aes(x=Horsepower..HP., y = Real.World.MPG), position = "jitter", alpha = 0.1) + geom_smooth(mapping = aes(x=Horsepower..HP., y = Real.World.MPG)) + labs(x = "HP", y = "Miles per Gallon", title = "Vehicle efficiency and horsepower")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Output : The relationship between horsepower and miles per gallon is opposite, however it can occasionally be affected by variables like engine type, transmission, and road conditions.
ggplot(data = rawdata) + geom_point(mapping = aes(x=Horsepower..HP., y = Weight..lbs.), position = "jitter", alpha = 0.1) + geom_smooth(mapping = aes(x=Horsepower..HP., y = Weight..lbs.)) + labs(x = "HP", y = "Weight(lbs)", title = "Relation between Weight and Power")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Output : The power needed to pull a vehicle directly relates to how much weight it has. The amount of power needed to lift something heavier grows.
ggplot(rawdata, aes(x=Model.Year, y=Production.Share, group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('production percentage for various vehicle kinds over time
')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank()) +theme(axis.text.y=element_blank(),axis.ticks.y=element_blank())
Output : Truck and SUV manufacturing percentages have significantly increased over time. While sedan production has been slowly declining over the years while that of cars and SUVs has been rising, sedan production has consistently outperformed that of cars and SUVs in terms of volume produced.
ggplot(data = rawdata) + geom_point(mapping = aes(x=Real.World.CO2..g.mi., y = Real.World.MPG), position = "jitter", alpha = 0.1) + geom_smooth(mapping = aes(x=Real.World.CO2..g.mi., y = Real.World.MPG)) + labs(x = "CO2 emmission ", y = "Miles per Gallon", title = "Efficiency of fuel according to vehicle class")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Output : A car with a higher mpg often uses less fuel and emits fewer pollutants, whereas a car with a lower mpg uses more fuel and emits more. They therefore have an inverse relationship.
plot(rawdata)
Output: From the above output we can see that the correlation between numereic variables
ggplot(rawdata, aes(x= Real.World.MPG,y=Weight..lbs.)) + geom_smooth(method = "lm",color="pink")+ggtitle("weight vs Real.World.MPG ")
## `geom_smooth()` using formula = 'y ~ x'
Output : We can observe from the graph above that there is a negative linear relationship between Real.World.MPG and weight.
LR_Model_Fit <- lm(Weight..lbs.~Real.World.MPG , data = rawdata)
summary(LR_Model_Fit)
##
## Call:
## lm(formula = Weight..lbs. ~ Real.World.MPG, data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1368.70 -355.98 -68.04 352.95 1398.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4998.77 120.35 41.534 <2e-16 ***
## Real.World.MPG -50.59 5.88 -8.604 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 503.4 on 382 degrees of freedom
## Multiple R-squared: 0.1623, Adjusted R-squared: 0.1602
## F-statistic: 74.04 on 1 and 382 DF, p-value: < 2.2e-16
Output: The Real.World.MPG and y-intercept coefficients’ anticipated slopes, which are 4998.77-50.59 * MPG, show that the fit prediction is accurate.The intercept and weight may not be random events, according to the p-value of each coefficient, which is a significant indication. R2 and modified R2, which show how much of the mpg variance can be explained by the equation, are general fit indicators and show a poor fit.
Need to take alpha as 0.005 in order to check the fit is good or not.
summary_LR_Model_Fit <- summary(LR_Model_Fit) # the model summary as an object
modelCoeffs <- summary_LR_Model_Fit$coefficients # need to check cofficients of model
estimation_of_β <- modelCoeffs["Real.World.MPG", "Estimate"] # need to check beta estimations
standard_error <- modelCoeffs["Real.World.MPG", "Std. Error"] # need to check get standard error
value_of_t <- estimation_of_β/standard_error # need to calculate value of t statistic
value_of_t
## [1] -8.604374
qt(p=0.25,df=380)
## [1] -0.6751359
Output : we reject the null hypothesis at 5% of significance level
cbind(LR_Model_Fit$residuals, LR_Model_Fit$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = LR_Model_Fit$residuals, x = LR_Model_Fit$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values") + theme(text = element_text(size = 16))+geom_hline(yintercept = 0)+ggtitle("Fitted Values vs Residuals")
Output : From the above chart we can see that the spread is scattered
throughout the fitted value and expanding across the residual
values.
sqrtdata<- sqrt(rawdata$Weight..lbs.)
LR_Model_Fit_1 <- lm(sqrtdata~Real.World.MPG ,data=rawdata)
summary(LR_Model_Fit_1)
##
## Call:
## lm(formula = sqrtdata ~ Real.World.MPG, data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8039 -2.7355 -0.4291 2.8432 10.2715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.09649 0.94819 74.982 <2e-16 ***
## Real.World.MPG -0.40517 0.04632 -8.746 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.966 on 382 degrees of freedom
## Multiple R-squared: 0.1669, Adjusted R-squared: 0.1647
## F-statistic: 76.5 on 1 and 382 DF, p-value: < 2.2e-16
Output: We received the regression equation as y=71.09649-0.40517x
Take alpha values as 0.05
modelSummary <- summary(LR_Model_Fit_1) # the model summary as an object
modelCoeffs <- modelSummary$coefficients # need to check cofficients of model
beta.estimate <- modelCoeffs["Real.World.MPG", "Estimate"] # need to check beta estimations
std.error <- modelCoeffs["Real.World.MPG", "Std. Error"] # need to check get standard error
t_value <- beta.estimate/std.error # need to calculate value of t statistic
t_value
## [1] -8.746488
qt(p = .025, df = 380)
## [1] -1.966226
Output : We reject the null hypotheis at 5% of significance
cbind(LR_Model_Fit_1$residuals,LR_Model_Fit_1$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = LR_Model_Fit_1$residuals, x = LR_Model_Fit_1$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values",title = "Residuals Vs Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)
Output : Despite a wider range between fitted and residual values, the residual values are higher.
cuberoot<- log10(rawdata$Weight..lbs.^(1/3))
LR_Model_Fit_2<- lm(cuberoot~Real.World.MPG,data=rawdata)
summary(LR_Model_Fit_2)
##
## Call:
## lm(formula = cuberoot ~ Real.World.MPG, data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.059297 -0.012435 -0.001512 0.013518 0.043848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2365187 0.0043514 284.166 <2e-16 ***
## Real.World.MPG -0.0018838 0.0002126 -8.861 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0182 on 382 degrees of freedom
## Multiple R-squared: 0.1705, Adjusted R-squared: 0.1683
## F-statistic: 78.52 on 1 and 382 DF, p-value: < 2.2e-16
summary_LR_Model_Fit <- summary(LR_Model_Fit_2) # the model summary as an object
coefficients_LR_Model_Fit <- modelSummary$coefficients # need to check cofficients of model
estimation_of_β <- coefficients_LR_Model_Fit["Real.World.MPG", "Estimate"] # need to check beta estimations
standard.error <- coefficients_LR_Model_Fit["Real.World.MPG", "Std. Error"] # need to check get standard error
value_of_t <- estimation_of_β/standard.error # need to calculate value of t statistic
value_of_t
## [1] -8.746488
qt(p=0.25,df=380)
## [1] -0.6751359
Output: We can conclude that the regression line is fitted if we reject the null hypothesis.
ggplot(rawdata, aes(x=Real.World.MPG,y=cuberoot)) + geom_point(color= "blue") + geom_smooth(method = "lm",color="yellow")+ggtitle("Real.World.MPG VS cuberoot")
## `geom_smooth()` using formula = 'y ~ x'
cbind(LR_Model_Fit_2$residuals, LR_Model_Fit_2$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = LR_Model_Fit_2$residuals, x = LR_Model_Fit_2$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values",title="Residuals vs Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)
#Along the fitted values, points are dispersed, and the residual values show little change.
#Regression modeling improvement through the use of additional significant data set variables
LR_Model_Fit=lm(formula = Real.World.MPG ~
poly(Real.World.CO2..g.mi., 4) +
poly(Horsepower..HP.,4) +
poly(Weight..lbs., 4) ,
data = rawdata)
summary(LR_Model_Fit)
##
## Call:
## lm(formula = Real.World.MPG ~ poly(Real.World.CO2..g.mi., 4) +
## poly(Horsepower..HP., 4) + poly(Weight..lbs., 4), data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45704 -0.02449 -0.00620 0.01536 0.24209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.996818 0.003260 6133.661 < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)1 -81.923344 0.170453 -480.622 < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)2 25.043496 0.078730 318.094 < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)3 -6.514450 0.068825 -94.652 < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)4 1.240789 0.068390 18.143 < 2e-16 ***
## poly(Horsepower..HP., 4)1 -1.230356 0.253400 -4.855 1.77e-06 ***
## poly(Horsepower..HP., 4)2 -0.172138 0.098917 -1.740 0.082649 .
## poly(Horsepower..HP., 4)3 -0.004813 0.083443 -0.058 0.954035
## poly(Horsepower..HP., 4)4 0.564276 0.073859 7.640 1.87e-13 ***
## poly(Weight..lbs., 4)1 0.925597 0.269905 3.429 0.000673 ***
## poly(Weight..lbs., 4)2 0.364124 0.102364 3.557 0.000423 ***
## poly(Weight..lbs., 4)3 0.390590 0.086846 4.498 9.21e-06 ***
## poly(Weight..lbs., 4)4 -0.284783 0.069246 -4.113 4.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06389 on 371 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 1.496e+05 on 12 and 371 DF, p-value: < 2.2e-16
Output : From the above calculations we can see that there are some variables which are not significant, So we need to adjust the model by removing irrelevant values.
LR_Model_Fit1=lm(formula = Real.World.MPG ~
poly(Real.World.CO2..g.mi.,4) +
poly(Horsepower..HP.) +
poly(Weight..lbs.) ,
data = rawdata)
summary(LR_Model_Fit1)
##
## Call:
## lm(formula = Real.World.MPG ~ poly(Real.World.CO2..g.mi., 4) +
## poly(Horsepower..HP.) + poly(Weight..lbs.), data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58635 -0.02053 -0.00846 0.01247 0.26011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.996818 0.003626 5514.140 <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)1 -81.682386 0.178860 -456.683 <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)2 24.885182 0.084085 295.952 <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)3 -6.572361 0.073707 -89.169 <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)4 1.381456 0.073231 18.864 <2e-16 ***
## poly(Horsepower..HP.) -0.293999 0.252005 -1.167 0.244
## poly(Weight..lbs.) -0.063953 0.271040 -0.236 0.814
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07106 on 377 degrees of freedom
## Multiple R-squared: 0.9997, Adjusted R-squared: 0.9997
## F-statistic: 2.419e+05 on 6 and 377 DF, p-value: < 2.2e-16
Output : The p value must be less than 0.05 in order to show that the data are statistically significant because all of the variables are now significant and we have a high r square value.
new_data <- rawdata [, c(5,8,11,12)]
Data_Scaled = scale(new_data)
head(Data_Scaled)
## Real.World.MPG Real.World.CO2..g.mi. Weight..lbs. Horsepower..HP.
## [1,] -1.585659 1.990871 0.1334871 -0.8204079
## [2,] -1.495341 1.805174 0.1281990 -0.8408112
## [3,] -1.494541 1.803580 0.1283282 -0.8402878
## [4,] -1.911468 2.765575 0.1555482 -0.7352955
## [5,] -1.847364 2.598289 0.0453413 -0.7558404
## [6,] -2.032214 3.103616 0.3797665 -0.7148259
#Analyzing the data to determine its covariance and correlation
covariance_Matrix = cov(new_data)
covariance_Matrix
## Real.World.MPG Real.World.CO2..g.mi. Weight..lbs.
## Real.World.MPG 19.13986 -448.3012 -968.3456
## Real.World.CO2..g.mi. -448.30121 11551.9612 21682.5655
## Weight..lbs. -968.34562 21682.5655 301774.0027
## Horsepower..HP. 58.26358 -1579.3562 22821.8769
## Horsepower..HP.
## Real.World.MPG 58.26358
## Real.World.CO2..g.mi. -1579.35616
## Weight..lbs. 22821.87693
## Horsepower..HP. 3111.97319
correlationMatrix = cor(new_data)
correlationMatrix
## Real.World.MPG Real.World.CO2..g.mi. Weight..lbs.
## Real.World.MPG 1.0000000 -0.9533945 -0.4029211
## Real.World.CO2..g.mi. -0.9533945 1.0000000 0.3672332
## Weight..lbs. -0.4029211 0.3672332 1.0000000
## Horsepower..HP. 0.2387316 -0.2634112 0.7447192
## Horsepower..HP.
## Real.World.MPG 0.2387316
## Real.World.CO2..g.mi. -0.2634112
## Weight..lbs. 0.7447192
## Horsepower..HP. 1.0000000
transpose_of_covariance_Matrix <- t(covariance_Matrix)
multiply = covariance_Matrix%*%transpose_of_covariance_Matrix
multiply
## Real.World.MPG Real.World.CO2..g.mi. Weight..lbs.
## Real.World.MPG 1142428 -26275575 -300630704
## Real.World.CO2..g.mi. -26275575 606276793 6758100965
## Weight..lbs. -300630704 6758100965 92059458115
## Horsepower..HP. -21209007 471651146 6923769308
## Horsepower..HP.
## Real.World.MPG -21209007
## Real.World.CO2..g.mi. 471651146
## Weight..lbs. 6923769308
## Horsepower..HP. 533020204
Output : When we combine the covariance matrix with its transposition, the resulting matrix is orthogonal if the identity matrix. The fact that it is not an identity matrix renders it non-orthogonal.
eigenResidual1 = eigen(covariance_Matrix)
eigenResidual1$values
## [1] 3.050861e+05 1.105685e+04 3.126774e+02 1.419667e+00
eigenResidual1$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.003249883 -0.03610169 0.01691637 0.999199651
## [2,] -0.073064688 0.94587191 0.31486842 0.029081875
## [3,] -0.994514555 -0.04560166 -0.09408069 0.003179807
## [4,] -0.074778268 -0.31928589 0.94430956 -0.027279867
eigenResidual2 = eigen(correlationMatrix)
eigenResidual2$values
## [1] 2.20082486 1.71075588 0.05046492 0.03795434
eigenResidual2$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.65047130 -0.1631404 0.4367088 0.5996314
## [2,] -0.64405470 0.1873250 0.7024558 0.2380309
## [3,] -0.40203020 -0.6041091 -0.3977543 0.5614404
## [4,] 0.02126844 -0.7571966 0.3970300 -0.5182356
Output : The covariance matrix and the correlation matrix both have different eigenvalues as vectors, despite the fact that both matrices are connected to one another. The eigenvectors differ in signs from one another for some variables.
Squareroot <- eigenResidual1$vectors %*% diag(sqrt(eigenResidual1$values)) %*% t(eigenResidual1$vectors)
Squareroot
## [,1] [,2] [,3] [,4]
## [1,] 1.337533 -3.593018 -1.636459 1.327816
## [2,] -3.593018 98.779114 35.076372 -23.481616
## [3,] -1.636459 35.076372 546.678104 41.036858
## [4,] 1.327816 -23.481616 41.036858 29.577020
Output : By multiplying eigen vectors by the square roots of the diagonals and then by the eigen vectors’ transpose, we can use the spectral decomposition method to determine the square root of the covariance matrix.
Percent_Variance_Explained <-eigenResidual2$values / sum(eigenResidual2$values)
Percent_Variance_Explained
## [1] 0.550206216 0.427688969 0.012616230 0.009488584
cumsum(Percent_Variance_Explained)
## [1] 0.5502062 0.9778952 0.9905114 1.0000000
plot(Percent_Variance_Explained)
plot(cumsum(Percent_Variance_Explained))
Output : We limit our features to three when the third one exceeds the
90% threshold.
eigen_vectors2 = eigenResidual2$vectors[,1:2]
eigen_vectors2
## [,1] [,2]
## [1,] 0.65047130 -0.1631404
## [2,] -0.64405470 0.1873250
## [3,] -0.40203020 -0.6041091
## [4,] 0.02126844 -0.7571966
colnames(eigen_vectors2) = c("pc1", "pc2")
row.names(eigen_vectors2) = colnames(new_data)
eigen_vectors2
## pc1 pc2
## Real.World.MPG 0.65047130 -0.1631404
## Real.World.CO2..g.mi. -0.64405470 0.1873250
## Weight..lbs. -0.40203020 -0.6041091
## Horsepower..HP. 0.02126844 -0.7571966
Output : MPG and horsepower are directly correlated with the first principal component (PC1), while real-world CO2 and weight are adversely correlated.MPG, weight, and horse power are inversely connected to the second primary component (PC2), which indicates that emissions are decreased when a vehicle is more fuel-efficient and has the appropriate weight and horse power.
Future work and Limitations : It would be more useful to describe production share and footprint as numerical variables so that a more thorough study could be done to determine the share of various vehicle types across all variables.
Learnings: According to the data analysis, the car’s fuel efficiency has been at an all-time high, and its emissions are at an all-time low. Even though SUVs are less efficient than sedans, there has been a noticeable trend in favor of them. Since the performance of SUVs has significantly improved over time and the comfort they offer is a good trade-off from an efficiency standpoint, we can conclude that SUV sales have increased relative to sedan sales.The data also reveals how numerous factors, including emissions, weight, horsepower, footprint, and efficiency, relate to one another.
Please see the link below for an examination of some of the readings in order to gain a better understanding of the data.
https://www.epa.gov/system/files/documents/2022-12/420s22001.pdf